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Abstract.  This  paper  considers  block  multi-convex  optimization,  where  the  feasible  set  and  objective  function  are  generally 
non-convex  but  convex  in  each  block  of  variables.  We  review  some  of  its  interesting  examples  and  propose  a  generalized  block 
coordinate  descent  method.  Under  certain  conditions,  we  show  that  any  limit  point  satisfies  the  Nash  equilibrium  conditions. 
Furthermore,  we  establish  its  global  convergence  and  estimate  its  asymptotic  convergence  rate  by  assuming  a  property  based  on 
the  Kurdyka-Lojasiewicz  inequality.  The  proposed  algorithms  are  adapted  for  factorizing  nonnegative  matrices  and  tensors,  as 
well  as  completing  them  from  their  incomplete  observations.  The  algorithms  were  tested  on  synthetic  data,  hyperspectral  data, 
as  well  as  image  sets  from  the  CBCL  and  ORL  databases.  Compared  to  the  existing  state-of-the-art  algorithms,  the  proposed 
algorithms  demonstrate  superior  performance  in  both  speed  and  solution  quality.  The  Matlab  code  is  available  for  download 
from  the  authors’  homepages. 
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1.  Introduction.  In  this  paper,  we  consider  the  optimization  problem 

S 

minF(xi,---  ,xs)  = /(x i,---  ,xs)  +  ^r^x,),  fi-1) 

i= 1 

where  variable  x  is  decomposed  into  s  blocks  x1;  •  •  •  ,  xs,  the  set  X  of  feasible  points  is  assumed  to  be  a  closed 
and  block  multi-convex  subset  of  R",  /  is  assumed  to  be  a  differentiable  and  block  multi-convex  function, 
and  Ti,  i  =  l,---  ,  s,  are  extended-value  convex  functions.  Set  X  and  function  /  can  be  non-convex  over 
X=  (xi,---  ,xs). 

We  call  a  set  X  block  multi-convex  if  its  projection  to  each  block  of  variables  is  convex,  namely,  for  each 
i  and  fixed  (s  —  1)  blocks  x1;  •  •  •  ,  xi_1)  Xj+1,  ■  •  •  ,  xs,  the  set 

Xi(xi,  ■  ■  ■  ,  Xj_i,  Xj_j_i,  ,xs)  =  {xi  £  Rni  :  (xi,---  ,  Xj_i,  xi;  xi+1,  ■  ■  •  ,xs)  £  Xj 


is  convex.  We  call  a  function  /  is  block  multi-convex  if  for  each  i,  f  is  a  convex  function  of  x,(  while  all  the 
other  blocks  are  fixed.  Therefore,  when  all  but  one  blocks  are  fixed,  (1.1)  over  the  free  block  is  a  convex 
problem. 

Extended  value  means  rj(xj)  =  oo  if  x*  ^  dom(ri),  i  =  1,  •  •  •  ,  s.  In  particular,  r,  (or  a  part  of  it)  can 
be  indicator  functions  of  convex  sets.  We  use  x  £  X  to  model  joint  constraints  and  ri, . . .  ,rs  to  include 
individual  constraints  of  x1;  •  •  •  ,xs,  when  they  are  present.  In  addition,  r*  can  include  nonsmooth  functions. 

Our  main  interest  is  the  block  coordinate  descent  (BCD)  method  of  the  Gauss-Seidel  type,  which  mini¬ 
mizes  F  cyclically  over  each  of  xi,  ■  •  •  ,  xs  while  fixing  the  remaining  blocks  at  their  last  updated  values.  Let 
x(:  denote  the  value  of  x,  after  its  /eth  update,  and 


=  /(> 


k-1 


rk- 1 


),  for  all  i  and  k. 
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At  each  step,  we  consider  three  different  updates 


where 


Original : 

x^  =  argmin/f  (Xi)  +  r^x*), 

(1.2a) 

Xi  G  xk 

Proximal : 

x(=  =  argmin/f(xi)  +  1  ||xi  x^  1||2  +  ri(xi), 

(1.2b) 

Prox-linear : 

x£  =  argmin  (g?,Xi  x^  :)  +  1  ||x,  x^  1 1|2  +  r^x*), 

*i£Xk  Z 

(1.2c) 

denotes  the  norm,  Lht  1  >  0, 


I  5 


ck  xfc_1  ■ 
S— l?  -■S+i  > 


k- 1\ 


and  in  the  last  type  of  update  (1.2c), 


x?-1  =  x?-1 


K  - 


„k  —  2\ 


(1.3) 


denotes  an  extrapolated  point,  >  0  is  the  extrapolation  weight,  gf  =  V/f  (x^_1)  is  the  block-partial 
gradient  of  /  at  x^-1.  We  consider  extrapolation  (1.3)  for  update  (1.2c)  since  it  significantly  accelerates  the 
convergence  of  BCD  in  our  applications.  The  framework  of  BCD  is  given  in  Alg.  1,  which  allows  each  to 
be  updated  by  (1.2a),  (1.2b),  or  (1.2c). 


Algorithm  1  Block  coordinate  descent  method  for  solving  (1.1) 
Initialization:  choose  initial  two  points  (xj"1,  •  ■  ■  ,  x^1)  =  (x?,  ■  •  ■  ,  x°) 
for  k  =  1,  2,  •  -  •  do 
for  i  =  1,  2,  •  •  •  ,  s  do 

xf  <—  (1.2a),  (1.2b),  or  (1.2c). 
end  for 

if  stopping  criterion  is  satisfied  then 
return  (xj,  ■  •  ■  ,  xj). 

end  if 
end  for 


Since  X  and  /  are  block  multi-convex,  all  three  subproblems  in  (1.2)  are  convex.  In  general,  the  three 
updates  generate  different  sequences  and  can  thus  cause  BCD  to  converge  to  different  solutions.  We  found 
in  many  tests,  applying  (1.2c)  on  all  or  some  blocks  give  solutions  of  lower  objective  values,  for  a  possible 
reason  that  its  local  prox-linear  approximation  help  avoid  the  small  regions  around  certain  local  minima.  In 
addition,  it  is  generally  more  time  consuming  to  compute  (1.2a)  and  (1.2b)  than  (1.2c)  though  each  time  the 
former  two  tend  to  make  larger  objective  decreases  than  applying  (1.2c)  without  extrapolation.  We  consider 
all  of  the  three  updates  since  they  fit  different  applications,  and  also  different  blocks  in  the  same  application, 
yet  their  convergence  can  be  analyzes  in  a  unified  framework. 

Original  subproblem  (1.2a)  is  the  most-used  form  in  BCD  and  has  been  extensively  studied.  It  dates 
back  to  methods  in  [52]  for  solving  equation  systems  and  to  works  [5,24,61,70],  which  analyze  the  method 
assuming  F  to  be  convex  (or  quasiconvex  or  hemivariate),  differentiable,  and  have  bounded  level  sets  except 
for  certain  classes  of  convex  functions.  When  F  is  non-convex,  BCD  may  cycle  and  stagnate  [56].  However, 
subsequence  convergence  can  be  obtained  for  special  cases  such  as  quadratic  function  [48],  strictly  pseudo¬ 
convexity  in  each  of  (s  —  2)  blocks  [22],  unique  minimizer  per  block  [47],  p.195.  If  F  is  non-differentiable,  BCD 
can  get  stuck  at  a  non-stationary  point;  see  [5]  p.94.  However,  subsequence  convergence  can  be  obtained  if 


2 


the  non-differentiable  part  is  separable;  see  works  [23, 50, 65, 66]  for  results  on  different  forms  of  F.  In  our 
objective  function,  /  is  differentiable  and  possibly  non-convex,  and  the  nonsmooth  part  is  block-separable 
functions  r,; . 

Proximal  subproblem  (1.2b)  has  been  used  with  BCD  in  [22].  For  X  =  Rn,  their  work  shows  that  every 
limit  point  is  a  critical  point.  Recently,  this  method  is  revisited  in  [4]  for  only  two  blocks  and  shown  to 
converge  globally  via  the  Kurdyka-Lojasiewicz  (KL)  inequality. 

Prox-linear  subproblem  (1.2c)  with  extrapolation  is  new  but  very  similar  to  the  update  in  the  block- 
coordinate  gradient  descent  (BGD)  method  of  [67],  which  identifies  a  block  descent  direction  by  gradient 
projection  and  then  performs  an  Armijo-type  line  search.  [67]  does  not  use  extrapolation  (1.3).  Their  work 
considers  more  general  /  which  is  smooth  but  not  necessarily  multi-convex,  but  it  does  not  consider  joint 
constraints.  While  we  are  preparing  the  paper,  [57]  provides  a  unified  convergence  analysis  of  coordinatewise 
successive  minimization  methods  for  nonsmooth  nonconvex  optimization.  Those  methods  update  block 
variables  by  minimizing  a  surrogate  function  that  dominates  the  original  objective  around  the  current  iterate. 
They  do  not  use  extrapolation  either  and  only  have  subsequence  convergence. 

There  are  examples  of  r*  that  make  (1.2c)  easier  to  compute  than  (1.2a)  and  (1.2b).  For  instance, 
if  r,  =  Sx>i  the  indicator  function  of  convex  set  T>i  (equivalent  to  x,  e  'Df),  (1.2c)  reduces  to  = 
VXknDi  (xjfc_1  -  g f-1/L,f-1)  ,  where  VXknD.  is  the  project  to  set  XfnDi.  If  rj(xj)  =  Aj||xi||i  and  X-f  = 
(1.2c)  reduces  to  x^  =  SLk- 1  ,A.  (x^-1  —  g f_1/T.f_1)  ,  where  <S„(-)  is  soft-thresholding  defined  component¬ 
wise  as  Su(t)  =  sign(t)  max(|t|  —  v,  0).  More  examples  arise  in  joint/group  t\  and  nuclear  norm  minimization, 
total  variation,  etc. 

1.1.  Contributions.  We  propose  Alg.  1  and  establish  its  global  convergence.  The  algorithm  is  ap¬ 
plied  to  two  classes  problems  (i)  nonnegative  matrix/tensor  factorization  and  (ii)  nonnegative  matrix/tensor 
completion  from  incomplete  observations,  and  is  demonstrated  superior  than  the  state-of-the-arts  on  both 
synthetic  and  real  data  in  both  speed  and  solution  quality. 

Our  convergence  analysis  takes  two  steps.  Under  certain  assumptions,  the  first  step  establishes  the  square 
summable  result  Y^k  ||xfc  —  xfc+1||2  <  oo  and  obtains  subsequence  convergence  to  Nash  equilibrium  points,  as 
well  as  global  convergence  to  a  single  Nash  point  if  the  sequence  is  bounded  and  the  Nash  points  are  isolated. 
The  second  step  assumes  the  KL  inequality  [13,14]  and  improves  the  result  to  llxfc  —  xfc+1||  <  oo,  which 
gives  the  algorithm  global  convergence,  as  well  as  asymptotic  rates  of  convergence.  The  classes  of  functions 
that  obey  the  KL  inequality  are  reviewed.  Despite  the  popularity  of  BCD,  very  few  works  establish  global 
convergence  without  the  (quasi)convexity  assumption  on  F;  works  [48, 67]  have  obtained  global  convergence 
by  assuming  a  local  Lipschitzian  error  bound  and  the  isolation  of  the  isocost  surfaces  of  F.  Some  very 
interesting  problems  satisfy  their  assumptions.  Their  and  our  assumptions  do  not  contain  each  other,  though 
there  are  problems  satisfying  both. 

1.2.  Applications.  A  large  number  of  practical  problems  can  be  formulated  in  the  form  of  (1.1)  such 
as  convex  problems:  (group)  Lasso  [64,74]  or  the  basis  pursuit  (denoising)  [15],  low-rank  matrix  recovery  [58], 
hybrid  huberized  support  vector  machine  [69],  and  so  on.  We  give  some  non-convex  examples  as  follows. 

Blind  source  separation  and  sparse  dictionary  learning.  Let  s i  =  1  ,■■■  ,p,  be  the  source 
signals.  Given  p  observed  signals  x,;  =  As,  +  rji,  where  A  G  Cmxn  is  an  unknown  mixing  matrix  and  r\i  is 
noise,  i  =  1,  •  •  •  ,p,  blind  source  separation  (BSS)  [27]  aims  to  estimate  both  A  and  si,  •  •  •  ,sm.  It  has  found 
applications  in  many  areas  such  as  artifact  removal  [26]  and  image  processing  [28].  Two  classical  approaches 
for  BBS  are  principle  component  analysis  (PCA)  [62]  and  independent  component  analysis  (ICA)  [18].  If 
m  <  n  and  no  prior  information  on  A  and  si,  •  •  •  , sm  is  given,  these  methods  will  fail.  Assuming  si,  •  •  •  , srn 
are  sparse  under  some  dictionary  B,  namely,  [si,  -  -  -  ,sp]  =  BY  and  Y  is  sparse,  [12,78]  apply  the  sparse 
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BSS  model 


min  ^||ABY  —  X|||.  +  r(Y),  subject  to  A  £  V  (1.4) 

where  r(Y)  is  a  regularizer  such  as  r( Y)  =  ||Y||i  =  JT  ■  \yij\,  V  is  a,  convex  set  to  control  the  scale  of  A 
such  as  ||A||i?  <  1,  and  A  is  a  balancing  parameter.  Note  that  model  (1.4)  is  block  multi-convex  in  A  and  Y 
each  but  jointly  non-convex.  A  similar  model  appears  in  cosmic  microwave  background  analysis  [10]  which 
solves 

min  ^ trace  ((ABY  -  X)TC_1(ABY  -  X))  +  r(Y),  subject  to  A  e  V  (1.5) 

for  a  certain  covariance  matrix  C.  Algorithms  for  (sparse)  BSS  include  online  learning  algorithm  [2,49], 
feature  extraction  method  [43],  feature  sign  algorithm  [40],  and  so  on. 

Model  (1.4)  with  B  =  I  also  arises  in  sparse  dictionary  training  [1,49],  where  the  goal  is  to  build  a 
dictionary  A  that  sparsely  represented  the  signals  in  X. 

Nonnegative  matrix  factorization.  Nonnegative  matrix  factorization  (NMF)  was  first  proposed  by 
Paatero  and  his  coworkers  in  the  area  of  environmental  science  [53].  The  later  popularity  of  NMF  can  be 
partially  attributed  to  the  publication  of  [38]  in  Nature.  It  has  been  widely  applied  in  data  mining  such  as 
text  mining  [55]  and  image  mining  [41],  dimension  reduction  and  clustering  [16,72],  hyperspectral  endmember 
extraction,  as  well  as  spectral  data  analysis  [54],  A  widely  used  model  for  (regularized)  NMF  is 


min 

X>0,Y>0 


M|||  +  ar1(X)  +  /3r2(Y), 


(1.6) 


where  M  is  the  input  nonnegative  matrix,  r1;  r2  are  some  regularizers  promoting  solution  structures,  and  a,  (3 
are  weight  parameters.  Two  early  popular  algorithms  for  NMF  are  the  projected  alternating  least  squares 
method  [53]  and  multiplicative  updating  method  [39].  Due  to  the  bi-convexity  of  the  objective  in  (1.6),  a 
series  of  alternating  nonnegative  least  square  (ANLS)  methods  have  been  proposed  such  as  [30,32,42];  they 
are  BCDs  with  update  (1.2a).  Recently,  the  classic  alternating  direction  method  (ADM)  has  been  applied 
in  [77].  We  compare  the  proposed  algorithms  to  them  in  Sec.  4  below. 

Nonnegative  tensor  factorization.  Nonnegative  tensor  factorization  (NTF)  is  a  generalization  of 
NMF  to  multi-dimensional  arrays.  The  initially  used  model  for  NTF  was  based  on  CANDECOMP/PARAFAC 
tensor  decomposition  [71] 


min  -||M  -  Ai  o  A2  o  •  •  •  o  Ajv||f; 

Ai ,•••  ,Aiv>0  z 


(1.7) 


and  then  it  was  proposed  based  on  Tucker  decomposition  [34] 

.  1I1'11.  x||A4  —  Q  Xi  Ai  x2  A2  •  •  •  xjv  Ajv||f- 

C/,Ai,---  .Ajv^O  Z 


(1.8) 


where  A4  is  a  given  nonnegative  tensor,  and  “o”  and  “xra”  represent  outer  product  and  tensor-matrix 
multiplication,  respectively.  (The  necessary  background  of  tensor  is  reviewed  in  Sec.  3)  Most  algorithms  for 
solving  NMF  have  been  directly  extended  to  NTF.  For  example,  the  multiplicative  update  in  [53]  is  extended 
to  solving  (1.7)  in  [63].  The  ANLS  methods  in  [30,32]  are  extended  to  solving  (1.7)  in  [31,33].  Algorithms  for 
solving  (1.8)  also  include  the  column- wise  coordinate  descent  method  [44]  and  the  alternating  least  square 
method  [21].  More  about  NTF  algorithms  can  be  found  in  [75]. 
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1.3.  Organization.  The  rest  of  the  paper  is  organized  as  follows.  Sec.  2  studies  the  convergence  of 
Alg.  1.  In  Sec.  3,  Alg.  1  is  applied  to  both  the  nonnegative  matrix/tensor  factorization  problem  and  the 
completion  problem.  The  numerical  results  are  presented  in  Sec.  4.  Finally,  Sec.  5  concludes  the  paper. 


2.  Convergence  analysis.  In  this  section,  we  analyze  the  convergence  of  Alg.  1  under  the  following 
assumptions. 

Assumption  1.  F  is  continuous  in  dom(F)  and  infxedom(F)  F(x)  >  —oo.  Problem  (1.1)  has  a  Nash 
point  (see  below  for  definition). 

Assumption  2.  Each  block  i  is  updated  by  the  same  scheme  among  (1.2a)-(1.2c)  for  all  k.  Let  l\,l2 
and  I3  denote  the  set  of  blocks  updated  by  (1.2a),  (1.2b)  and  (1.2c),  respectively.  In  addition,  there  exist 
constants  0  <  I  <  L  <  00  such  that 

1.  for  i  el\,  f?  is  strongly  convex  with  modulus  t  <  A*-1  <  L,  namely, 

fi(u)  ~  fi(v)  —  (V/^v),  u  —  v)  +  — y— ||u  —  v||2,  for  all  u,  veX?-,  (2.1) 


2.  for  i  €  I2,  parameters  L1)  1  obey  i  <  1  <  L; 

3.  for  i  el 3,  V/f  is  Lipschitz  continuous  and  parameters  L 

f?  (*?)  <  f?  e*?-1)  +  (g  w  -  i?-1)  + 


1  obey  £  <  L\~x 


<  L  and 


1 1 1 2 


(2.2) 


The  inequality  in  (2.2)  guarantees  sufficient  decrease  of  the  objective.  Taking  L as  the  Lipschitz 
constant  of  V/jfe  can  satisfy  (2.2).  However,  we  allow  smaller  L^~x,  which  can  speed  up  the  algorithm. 

For  our  analysis  below,  we  need  the  Nash  equilibrium  condition  of  (1.1):  for  *  =  !,•••  ,  s, 


F(x  1,  •  •  ■  ,Xi_i,Xi 


U+i,  ■ 


5)  ^  A(xi ,  •  •  •  ,  ,  x^ ,  x^-j-i ,  ■  *  ,  xs) ,  Vxi  e  Xi , 


(2.3) 


or  equivalently 

(VXi/(x)  +  p,,Xi  —  Xj)  >  0,  for  all  x*  e  X, ;  and  for  some  p;  £  9r,(xi),  (2.4) 

where  X,  =  Ai(xi,  •  •  •  ,  Xj_i,Xj+i,  •  •  •  ,xs)  and  9r(xj)  is  the  limiting  subdifferential  (e.g.,  see  [60])  of  r  at  x,:. 
We  call  x  a  Nash  point.  Let  N  be  the  set  of  all  Nash  points,  which  we  assume  to  be  nonempty.  As  shown 
in  [4],  we  have 

5F(x)  =  {VXl/(x)  +  <9n(xi)}  x  •  •  •  x  {VXs/(x)  +  9rs(xs)}. 

Therefore,  if  X  =  Rra,  (2.4)  reduces  to  the  first-order  optimality  condition  0  e  dF(x),  and  x  is  a  critical 
point  (or  stationary  point )  of  (1.1). 

2.1.  Preliminary  result.  The  analysis  in  this  subsection  follows  the  following  steps.  First,  we  show 
sufficient  descent  at  each  step  (inequality  (2.7)  below),  from  which  we  establish  the  square  summable  result 
(Lemma  2.2  below).  Next,  the  square  summable  result  is  exploited  to  show  that  any  limit  point  is  a  Nash 
point  in  Theorem  2.3  below.  Finally,  with  the  additional  assumptions  of  isolated  Nash  points  and  bounded 
{xfe},  global  convergence  is  obtained  in  Corollary  2.4  below.  The  first  step  is  essential  while  the  last  two 
steps  use  rather  standard  arguments.  We  begin  with  the  following  lemma  similar  to  Lemma  2.3  of  [8]. 

Lemma  2.1.  Let  £i(u)  and  ^2(u)  be  two  convex  functions  defined  on  the  convex  set  IA  and  £i(u)  be 
differentiable.  Let  £(u)  =  £i(u)  +  £2(11)  and  u*  =  argmin(V£i(v),  u  —  v)  +  A||u  —  v||2  +  £2(11).  If 

£i(u*)  <  £i(v)  +  (V£i(v),pL(v)  -  v)  +  ~ ||u*  -  v||2, 
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(2.5) 


then  we  have 


£(u)  —  £(u*)  >  —  |ju*  —  vl|2  +  T(v  —  u, u*  —  v)  for  any  u  £  U. 


(2.6) 


Based  on  this  lemma,  we  can  show  our  key  lemma  below. 

Lemma  2.2  (Square  summable  ||xfc  —  1| ) .  Under  Assumptions  1  and  2,  let  {xfc}  be  the  sequence 

k-l  I  Lk~2 

generated  by  Alg.  1  with  0  <  u>*  <  6ut  \,_1  for  5U  <  1  uniformly  over  all  i  £  I3  and  k.  Then 

EfcLollxfe^xfc+1ll2  <°°- 

Proof.  For  i  £  I3,  we  have  the  inequality  (2.2)  and  use  Lemma  2.1  by  letting  F, f  =  f f  +  ri  and  taking 
£1  =  fli' , £2  =  rj,v  =  x^_1  and  u  =  xf_1  in  (2.6)  to  have 


rfc-l 


^(X.  )  ^  iT(Xi  )  >  -A 


II*?-1  -  x?|2  +  -  x^x?  -  x*"1) 


r  k~  1 


-fc-1 


llx?-1  -  x?||2  - 


-(a;lfc-1)2||xf-2-xfc-1"2 


(2.7) 


> 


Lfl,,'fc-i  ”fc"2  L<  < Ilx?-2 -x?"1!!2. 


X,,  -x,  - 


For  i  £  U  J2  we  have  Ff(xf~L)  -  Ff  (x£)  >  2 

(regard  wf  =  0  for  *  £  2i  U  X2)-  Therefore, 


—  xf  ||2,  and  thus  the  inequality  (2.7)  still  holds 


F(xk~1)  -  F(xk)  =  £?=1  (KHx-t1)  ~  KH^)) 

>  eu( 


r  fe  —  1  r  fe  —  2  e2  ,  „  11 

^llx"-1  -  x?|l2  -  ^i-llx?-2  -  x"-1"2 


)• 


Summing  the  above  inequality  over  /c  from  1  to  K,  we  have 


K 


F{x°)-F(xk)>Y.H 


r  k—1 


-  k—2 


c2  \\„k-2  1 1| 2 


fc=l 2=1 
K 


^-||Xi_1  -  x^ll2  -  -L-«5^ ||x?-a  -  XV 


>  E  E  (1~i“)Lr‘ii»L1  -  *?  ii2  >  E  Ppiix*->  ^  Af . 


Al=l  2=1 


k= 1 


Since  F  is  lower  bounded,  taking  Ji  — >  oc  completes  the  proof.  □ 

Now,  we  can  establish  the  following  preliminary  convergence  result. 

Theorem  2.3  (Limit  point  is  Nash  point).  Under  the  assumptions  of  Lemma  2.2,  any  limit  point  of 
{xfc}  is  a  Nash  point,  namely,  satisfying  the  Nash  equilibrium  condition  (2.4). 

Proof.  Let  x  be  a  limit  point  of  {xfc}  and  {xfcj}  be  the  subsequence  converging  to  x.  Since  { L k}  is 
bounded,  passing  another  sequence  if  necessary,  we  have  Li  3  — >  Lj  for  i  =  1,  •  •  •  ,  s  as  j  — >  oo.  Lemma  2.2 
implies  that  ||xfc+1  —  xfe  ||  — >  0,  so  {xfe^+1}  also  converges  to  x. 

For  i  £  Ji,  we  have  F?J+1(x?+1)  <  F?3+1(x!),  Vx,;  £  X*3+1.  Letting  j  — ►  oo,  we  have  (2.3)  from  the 
continuity  of  F  and  the  closedness  of  X .  Similarly,  for  i  £  I2,  we  have 

F(xlf  ■  •  •  ,Xi-l,Xi,Xi+l,"  •  ,XS)  <  F(X!,--  •  ,Xi_1,Xi,Xi+1,  ■  •  •  ,XS)  +  y||Xi  -  Xj||2,  Vx,  £  Xi, 
namely, 

x»  =  argminF(xi,  •  •  •  ,  5c*_i,  xi;  xi+i,  •  •  •  ,xs)  +  -A -||x£  -  Xj||2.  (2.8) 

xieAi  ^ 
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Thus,  Xi  satisfies  the  first-order  optimality  condition  of  (2.8),  which  is  precisely  (2.4).  For  i  EX 3,  we  have 


fc7-+i 


=  argmin  (V/^+^x^Xi  -  x/3)  +  ^Mx*  -  xf  f  +  r^xf). 

xi  ex,kj+1 


L- 


'ki  1. 2 


fc  .  _(_1 

The  convex  proximal  minimization  is  continuous  in  the  sense  that  the  output  x?  J  depends  continuously 
on  the  input  x/J  [59].  Letting  j  — >  oo,  from  x^J+1  — >  x,:  and  xy1  — >  x»,  we  get 


Xi  =  argmin  (VXi/(x),xi  -  Xj) 


F 

2 


II 


Xi  -  Xi  ||2 


+  J'i(Xi). 


(2.9) 


Hence,  x$  satisfies  the  first-order  optimality  condition  of  (2.9),  which  is  precisely  (2.4).  This  completes  the 
proof.  □ 

COROLLARY  2.4  (Global  convergence  given  isolated  Nash  points).  Under  the  assumptions  of  Lemma 
2.2,  we  have  dist(xfc,A f)  — >  0,  if  {xfc}  is  bounded.  Furthermore,  if  Af  contains  uniformly  isolated  points, 
namely,  there  is  r/  >  0  such  that  ||x  —  y||  >  p  for  any  distinct  points  x,y  £  A f,  then  {xfc}  converges  to  a 
point  in  A f. 

Proof.  Suppose  dist(xfc,A/")  does  not  converge  to  0.  Then  there  exists  e  >  0  and  a  subsequence  {xkj} 
such  that  dist(xfc3,  Af)  >  £  for  all  j.  However,  the  boundedness  of  {xfcj}  implies  that  it  must  have  a  limit 
point  x£  A f  according  to  Theorem  2.3,  which  is  a  contradiction. 

From  dist (xk,Af)  — >  0,  it  follows  that  there  is  an  integer  K\  >  0  such  that  xfc  £  Uy&jq-B(y, 1)  for  all 
k  >  K i,  where  B( y,  =  {x  G  X  :  ||x  —  y||  <  ([}.  In  addition,  Lemma  2.2  implies  that  there  exists  another 
integer  K2  >  0  such  that  ||xfc  —  xfc+1||  <  |  for  all  k  >  K^.  Take  K  =  max(A'i,  K2)  and  assume  xA  £  B(x,  |) 
for  some  x  £  Af.  We  claim  that  for  any  x.  ^  y  £  Af,  ||xfc  —  y||  >  3  holds  for  all  k  >  K.  This  claim  can  be 
shown  by  induction  on  k  >  K.  If  some  xk  £  B(x,  |),  then  ||xfc+1  —  x||  <  ||xfc+1  —  xfc||  +  ||xfc  —  x||  <  and 

||xfc+1  -  y ||  >  ||x  -  y ||  -  ||xfc+1  -  x||  >  for  any  x  ±  y  e  Af. 

Therefore,  xk  £  B(x,  (()  for  all  k  >  I\  since  xk  £  Uy&^/B(y,  ^),  and  thus  {xfc}  has  the  unique  limit  point  x, 
which  means  xk  — >  x.  □ 

Remark  2.1.  The  boundedness  of  {xk}  is  guaranteed  if  the  level  set  {x  £  X  :  F(x)  <  F(x0)}  is  bounded. 
However,  the  isolation  assumption  does  not  hold,  or  holds  but  is  difficult  to  verify,  for  many  functions.  This 
motivates  another  approach  below  for  global  convergence. 

2.2.  Kurdyka-Lojasiewicz  inequality.  Before  proceeding  with  our  analysis,  let  us  briefly  review  the 
Kurdyka-Lojasiewicz  inequality,  which  is  central  to  the  global  convergence  analysis  in  the  next  subsection. 

Definition  2.5.  A  function  ip(x)  satisfies  the  Kurdyka-Lojasiewicz  (KL)  property  at  point  x.  £  dom (dif) 
if  there  exists  0  £  [0, 1)  such  that 

IVKx)  -  y>(x)ie 

dist(0,dV’(x))  1  ■  ’ 

is  bounded  around  x  under  the  notational  conventions:  0°  =  1, 00/00  =  0/0  =  0.  In  other  words,  in  a  certain 
neighborhood  U  of  x,  there  exists  <f>{s )  =  cs1^6  for  some  c  >  0  and  d  £  [0, 1)  such  that  the  KL  inequality 
holds 


(f'Qiffx)  —  ,0(x)|)dist(O,  ch/>(x))  >  1,  for  any  x  £  U  fl  dom(ch/>)  and  if(x)  ^  i/j(x),  (2.11) 

where  dom(9V’)  —  {x  :  dif(x)  ^  0}  and  dist(0, dif(x))  =  min{||y||  :  y  €  ch/>(x)}. 
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This  property  was  introduced  by  Lojasiewicz  [46]  on  real  analytic  functions,  for  which  the  term  with 
6  £  [|,1)  in  (2.10)  is  bounded  around  any  critical  point  x.  Kurdyka  extended  this  property  to  functions 
on  the  o-minimal  structure  in  [14,36].  Recently,  the  KL  inequality  was  extended  to  nonsmooth  sub-analytic 
functions  [13].  Since  it  is  not  trivial  to  check  the  conditions  in  the  definition,  we  give  some  examples  below 
that  satisfy  the  KL  inequality. 

Real  analytic  functions.  A  smooth  function  tp(t)  on  R  is  analytic  if  is  bounded  for  all  k 

and  on  any  compact  set  DcB.  One  can  verify  whether  a  real  function  tHx)  011  is  analytic  by  checking 
the  analyticity  of  ip(t)  =  ip{x  +  ty)  for  any  x,  y  £  R".  For  example,  any  polynomial  function  is  real  analytic 
such  as  ||Ax  —  b||2  and  the  objectives  of  (1.7)  and  (1.8).  In  addition,  it  is  not  difficult  to  verify  that  the 
non-convex  function  Lq(x,e,  A)  =  J2i=i(xi  +  e2)?^2  +  ;p|Ax  —  b||2  with  0  <  q  <  1  considered  in  [37]  for 
sparse  vector  recovery  is  a  real  analytic  function  (the  first  term  is  the  e-smoothed  £q  semi-norm) .  The  logistic 
loss  function  ip{t)  =  log(l  +  e_t)  is  also  analytic.  Therefore,  all  the  above  functions  satisfy  the  KL  inequality 
with  6  £  [|,  1)  in  (2.10). 

Locally  strongly  convex  functions.  A  function  ip(x)  is  strongly  convex  in  a  neighborhood  V  with 
constant  /r,  if 

ip(y)  >  ^(x)  +  (7(x),y  —  x)  +  ^||x  —  y||2,  for  all  y(x)  £  dip(x)  and  for  any  x,y  £  T>. 

According  to  the  definition  and  using  the  Cauchy-Schwarz  inequality,  we  have 

V'(y)  ^  tHx)  >  (7(x),y  -  x)  +  |llx-  y II2  >  -^Il7(x)l|2.  for  all  y(x)  e  dip(x). 

Hence,  p,{ip{x)  —  ip( y))  <  dist(0,  dip(x)),  and  ip  satisfies  the  KL  inequality  (2.11)  at  any  point  y  £  V  with 
(p(s)  =  p\fs  and  U  =  V  U  {x  :  ip{x)  >  ^>(y)}.  For  example,  the  logistic  loss  function  ip(t)  =  log(l  +  e_t)  is 
strongly  convex  in  any  bounded  set  T>. 

Semi-algebraic  functions.  A  set  V  C  M"  is  called  semi- algebraic  [11]  if  it  can  be  represented  as 

S  t 

X>  =  (J  Pi  {x  e  R"  :  Pij{x)  =  0,  ql3(x)  >  0}, 

*=1.7  =  1 

where  are  real  polynomial  functions  for  1  <  i  <  s,  1  <  j  <  t.  A  function  ip  is  called  semi-algebraic  if 

its  graph  Gr(^)  =  {(x,  ^>(x))  :  x  £  dom(i)i)}  is  a  semi- algebraic  set. 

Semi-algebraic  functions  are  sub-analytic,  so  they  satisfy  the  KL  inequality  according  to  [13,14].  We 
list  some  known  elementary  properties  of  semi-algebraic  sets  and  functions  below  as  they  help  identify  semi- 
algebraic  functions. 

1.  If  a  set  V  is  semi-algebraic,  so  is  its  closure  c\(T>). 

2.  If  Di  and  V2  are  both  semi-algebraic,  so  are  V i  UX>2,  T^i  nX>2  and  R"\2?i. 

3.  Indicator  functions  of  semi-algebraic  sets  are  semi-algebraic. 

4.  Finite  sums  and  products  of  semi-algebraic  functions  are  semi-algebraic. 

5.  The  composition  of  semi-algebraic  functions  is  semi-algebraic. 

From  items  1  and  2,  any  polyhedral  set  is  semi-algebraic  such  as  the  nonnegative  orthant  R"  =  {x  £  R11  : 
Xi  >  0,V*}.  Hence,  the  indicator  function  <5r"  is  a  semi-algebraic  function.  The  absolute  value  function 
pit)  =  |t|  is  also  semi-algebraic  since  its  graph  is  cl(2?),  where 


T>  =  { (t,  s)  :  t  +  s  =  0,  —  t  >  0}  U  {(t,  s)  :  t  —  s  =  0,  t  >  0}. 


Hence,  the  t'l-norm  ||x||i  is  semi-algebraic  since  it  is  the  finite  sum  of  absolute  functions.  In  addition,  the 
sup-norm  ||x||oo  is  semi-algebraic,  which  can  be  shown  by  observing 

GraphdlxHoo)  =  {(x,t)  :  t  =  max|a;j|}  =  (J{(x,f)  :  \xi\  =  t,  \xj\  <  t,Vj  ±  i}. 

i 

Further,  the  Euclidean  norm  ||x||  is  shown  to  be  semi-algebraic  in  [11].  According  to  item  5,  ||Ax  —  b||i, 

1 1  Ax  —  bHoo  and  ||Ax  —  b||  are  all  semi-algebraic  functions. 

2.3.  Global  convergence  and  rate.  If  {xfc}  is  bounded,  then  Theorem  2.3  guarantees  that  there 
exists  one  subsequence  converging  to  a  Nash  point  of  (1.1).  In  this  subsection,  we  assume  X  =  Rra  and 
strengthen  this  result  for  problems  with  F  obeying  the  KL  inequality.  Our  analysis  here  was  motivated 
by  [4],  which  applies  the  inequality  to  establish  the  global  convergence  of  the  alternating  proximal  point 
method  —  the  special  case  of  BCD  with  two  blocks  and  using  update  (1.2b). 

We  make  the  following  modification  to  Alg.  1.  From  the  proof  of  Lemma  2.2,  we  can  see  that  this 
modification  makes  F(x.k)  strictly  decreasing. 

(Ml).  Whenever  F(xk)  >  F(xfc_1),  we  re-do  the  fcth  iteration  with  x*_1  =  x^1  (i.e. ,  no  extrapolation)  for 
all  i  €  X3. 

In  the  sequel,  we  use  the  notion  =  F(xk)  and  F  =  F(x).  First,  we  establish  the  following  pre¬ 
convergence  result,  the  proof  of  which  is  given  in  the  Appendix. 

Lemma  2.6.  Under  Assumptions  1  and  2,  let  {xfc}  be  the  sequence  of  Alg.  1  with  (Ml).  Let  Ik  = 
minigx3  Lk ,  and  choose  Lk  >  and  tuk  <  S ^ yj <  1,  for  aM  *  £  X3  and  k.  Assume  that  V/  is 
Lipschitz  continuous  on  any  bounded  set  and  F  satisfies  the  KL  inequality  (2.11)  at  x.  If  x o  is  sufficiently 
close  to  x  and  Fk  >  F  for  k  >  0,  then  for  some  B  CW  D  dom(9V’)  with  if  =  F  in  (2.11),  {xk}  C  B  and  xk 
converges  to  a  point  in  B. 

Remark  2.2.  In  the  lemma,  the  required  closeness  o/x°  to  x  depends  on  ll,cf  and  if  =  F  in  (2.11)  (see 
the  inequality  in  (A.l)/  The  extrapolation  weight  tvk  must  be  smaller  than  it  is  in  Lemma  2.2  in  order  to 
guarantee  sufficient  decrease  at  each  iteration. 

The  following  corollary  is  a  straightforward  application  of  Lemma  2.6. 

Corollary  2.7.  Under  the  assumptions  of  Lemma  2.6,  {xfe}  converges  to  a  global  minimizer  of  (1.1) 
if  the  initial  point  x°  is  sufficiently  close  to  any  global  minimizer  x. 

Proof.  Suppose  F(xk°)  =  F(x)  at  some  k o.  Then  xfc  =  xfe°,  for  all  k  >  ko ,  according  to  the  update  rules 
of  xk.  Now  consider  F(xk)  >  F(x)  for  all  k  >  0,  and  thus  Lemma  2.6  implies  that  xk  converges  to  some 
critical  point  x*  if  x°  is  sufficiently  close  to  x,  where  x°,x*,x  £  B.  If  F(x*)  >  F(x),  then  the  KL  inequality 
(2.11)  indicates  cf'  (F(x*)  —  F(x))  dist  (0,  dF(x*))  >  1,  which  is  impossible  since  0  £  dF(x*).  □ 

Next,  we  give  the  convergence  result  of  Alg.  1. 

Theorem  2.8  (Global  convergence).  Under  the  assumptions  of  Lemma  2.6  and  that  from  any  starting 
point  {xfe}  has  a  finite  limit  point  x  ivhere  F  satisfies  the  KL  inequality  (2.11),  the  sequence  {xfc}  converges 
to  x,  which  is  a  critical  point  of  (1.1). 

Proof.  Note  that  F(xk)  is  monotonically  nonincreasing  and  converges  to  F(5c).  If  F(xk°)  =  F(x)  at 
some  fco,  then  xk  =  xk°  =  x  for  all  k  >  k0  from  the  update  rules  of  xk.  It  remains  to  consider  F(xk)  >  F(x) 
for  all  k  >  0.  Since  x  is  a  limit  point  and  F(xk)  — >  X(x),  there  must  exist  an  integer  fco  such  that  xfe°  is 
sufficiently  close  to  x  as  required  in  Lemma  2.6  (see  the  inequality  in  (A.l)).  The  conclusion  now  directly 
follows  from  Lemma  2.6.  □ 

We  can  also  estimate  the  rate  of  convergence,  and  the  proof  is  given  in  the  Appendix. 
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Theorem  2.9  (Convergence  rate).  Assume  the  assumptions  of  Lemma  2.6,  and  suppose  that  con¬ 
verges  to  a  critical  point  x,  at  which  F  satisfies  the  KL  inequality  with  tf(s)  =  cs 1-0  for  c  >  0  and  9  £  [0, 1). 
We  have: 


1.  If  9  =  0,  xfc  converges  to  x  in  finite  iterations; 

2.  If  9  £  (0,  |],  ||xfc  —  x||  <  Crk ,  Vfc  >  ko,  for  certain  ko  >  0,  C  >  0,  r  £  [0, 1); 

3.  If  9  £  (|,  1),  ||xfc  —  x||  <  Ck~^1~6^<'2e~1') ,  Vfc  >  ko,  for  certain  ko  >0,  C  >  0. 


3.  Factorization  and  completion  of  nonnegative  matrices  and  tensors.  In  this  section,  we  apply 
Alg.  1  with  modification  (Ml)  to  the  factorization  and  the  completion  of  nonnegative  matrices  and  tensors. 
Since  a  matrix  is  a  two-way  tensor,  we  present  the  algorithm  for  tensors.  We  first  overview  tensor  and  its 
two  popular  factorizations. 

3.1.  Overview  of  tensor.  A  tensor  is  a  multi-dimensional  array.  For  example,  a  vector  is  a  first-order 
tensor ,  and  a  matrix  is  a  second-order  tensor.  The  order  of  a  tensor  is  the  number  of  dimensions,  also  called 
way  or  mode.  For  an  N- way  tensor  X  £  R/|  x,2X'"x/n!  we  let  its  («i,*2,  •  •  •  ,*Ar)th  element  be  denoted  by 
Xi1i2.-iN.  Below  we  list  some  concepts  related  to  tensor.  For  more  details  about  tensor,  the  reader  is  referred 
to  the  review  paper  [35]. 

1.  fiber:  a  fiber  of  a  tensor  AT  is  a  vector  obtained  by  fixing  all  indices  of  X  except  one.  For  example, 
a  row  of  a  matrix  is  a  mode-2  fiber  (the  1st  index  is  fixed),  and  a  column  is  a  mode-1  fiber  (the  2nd 
index  is  fixed).  We  use  Xi1...in_1]iri+1...iN  to  denote  a  mode-n  fiber  of  an  ATh-order  tensor  X. 

2.  slice:  a  slice  of  a  tensor  AT  is  a  matrix  obtained  by  fixing  all  indices  of  X  except  two.  Take  a 
third-order  tensor  X  for  example.  X,::.  X:J:.  and  X::fc  denote  horizontal,  lateral,  and  frontal  slices 
of  AT,  respectively. 

3.  matricization:  the  mode-n  matricization  of  a  tensor  AT  is  a  matrix  whose  columns  are  the  mode-n 
fibers  of  X  in  the  lexicographical  order.  We  let  X(„)  denote  the  mode-n.  matricization  of  X. 

4.  tensor-matrix  product:  the  mode-n  product  of  a  tensor  X  £  RIlXl2 x'"x/at  with  a  matrix  A  £ 
Rjxln  js  a  tensor  of  size  I\  x  ■  •  •  J„_i  x  J  x  In+ 1  x  •  •  ■  x  7jv  defined  as 


(3.1) 


In  addition,  we  briefly  review  the  matrix  Kronecker,  Khatri-Rao  and  Hadamard  products  below,  which 
we  use  to  derive  tensor-related  computations. 

The  Kronecker  product  of  matrices  A  £  Rmxn  and  B  £  Rpxq  is  an  mp  x  nq  matrix  defined  by  A  ® 
B  =  [aij^mpxnq-  The  Khatri-Rao  product  of  matrices  A  £  Rmxq  and  B  £  Rpxq  is  an  mp  x  q  matrix: 
A  ©  B  =  [ai  (gi  bi,  a2  ®  b2,  •  •  •  ,  &q  (g>  bg] ,  where  a ,  ,  b,  are  the  ith  columns  of  A  and  B,  respectively.  The 
Hadamard  product  of  matrices  A,  B  £  Rmxrl  is  the  componentwise  product  defined  by  A  *  B 

Two  important  tensor  decompositions  are  the  CANDECOMP/PARAFAC  (CP)  [29]  and  Tucker  [68] 
decompositions.  The  former  one  decomposes  a  tensor  X  £  R/iX/2x---x/jv  jn  ^he  form  0f  x  =  A10A20  •  -oAjv, 
where  A„  £  RInXr,n  =!,•••  , N  are  factor  matrices,  r  is  the  tensor  rank  of  AT,  and  the  outer  product  “o” 


is  defined  as 


r 


where  a!]1-1  is  the  (i,  j)th  element  of  An  and  [I]  =  {1,2,  •  •  •  ,  I}.  The  latter  Tucker  decomposition  decomposes 


a  tensor  X  in  the  form  of  X  =  Q  xt  Ai  X2  A2  •  •  •  Xjv  Ajv,  where  Q  £  RjixJ2x--xJn  caneci  the  core  tensor 
and  An  £  K/nX  Jn,  n  =  1,  •  ■  ■  ,  N  are  factor  matrices. 
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3.2.  An  algorithm  for  nonnegative  tensor  factorization.  One  can  obtain  a  nonnegative  CP  de¬ 
composition  of  a  nonnegative  tensor  tensor  Ai  €  M.IiX'"xIn  by  solving 


1, 


min  - ||A4  —  Ai  o  A2  o  •  •  •  o  Ajv||p.,  subject  to  A„  €  n  =  1,  • 


,N 


(3.2) 


where  r  is  a  specified  order  and  the  Frobenius  norm  of  a  tensor  X  G 


iX  —  xIn 


x  . 

,lN  ■■•In’ 


min  -||A1  —  Q  xx  A:  x2  A2  -  ■  ■  xN  Ajvllp-,  subject  to  Q  G  K^lX'"x  N  ,An  S  R+*x  ",Vn, 


(3.3) 


is  defined  as  ||A||f  = 

.  Similar  models  based  on  the  CP  decomposition  can  be  found  in  [19,31,33].  One  can 
obtain  a  nonnegative  Tucker  decomposition  of  Ai  by  solving 

^ \\.\A  —  G  x  i  A,  Xo  Ao  . . .  x  ^  A  ml'2 

as  in  [34,44,51].  Usually,  it  is  computationally  expensive  to  update  Q.  Since  applying  Alg.  1  to  problem 
(3.3)  involves  lots  of  computing  details,  we  focus  on  applying  Alg.  1  with  update  (1.2c)  to  problem  (3.2). 
Let  A  =  (Ai,  •  •  •  ,Aj\r)  and 

F( A)  =  F(A1,  A2,  •  •  •  ,  Ajv)  =  h\M  -  Ax  o  A2  o  •  •  •  o  AN\\2F 

be  the  objective  of  (3.2).  Consider  updating  A„  at  iteration  k.  Using  the  fact  that  if  X  =  Aio A20-  •  -  oA jv, 
then  X(n)  =  An  (Ajy  0  •  •  ■  An+i  ©  A„_i  •  •  ■  Ai)T,  we  have 


F{ A)  =  -  M(„)  -  An  (An  ©  ■ 


Ln+1  ©  An_i  ■  ■  •  Aj 


and 


Let 


Va nF  —  ^A„  (Ajv  ©  •  •  ■  An+\  ©  A„_i  •  •  ■  Ai)t  —  (An  ©  •  ■  ■  An+i  ©  A„_i  •  •  ■  Ai) . 


Bfe_1  =  A^_1  ^  a/=-i^a  k 


-jv  ®'"K+i  ©A^-.-Af.  (3.4) 

We  take  L^-1  =  max(£fe_2,  ||(Bfc-1)TBfc-1||),  where  £k~2  =  min „Lj-2  and  ||A||  is  the  spectral  norm  of  A. 
Let 


,k- 1 


=  min  ©fc_  i,<L 


1  fk-2 

7F3T 


(3.5) 


where  5U  <  1  is  pre-selected  and  u>k- i  =  tk  fk  1  with  to  =  1  and  tk  =  \  ^1  +  yj  1  +  In  addition, 


let 


A^-1  =  At1  +  cot1  (At1  ~  Akn~2),  and  G*"1  =  (A^B^T  -  M(n) 


derive  the  update  (1.2c): 


B 


k- 1 


be  the  gradient.  Then  we 


A„  =  argmin  (  G„  ,An- A 

A„>0 


k- 1 


Tk— 1  2 

a  —  Afc_1 


which  can  be  written  in  the  closed  form 


Al  =  max 


(3.6) 


At  the  end  of  iteration  k,  we  check  whether  F  (Ak)  >  F  ( Ak  -1).  If  so,  we  re-update  Ak  by  (3.6)  with 
A^_1  =  Ak_1,  for  n  =  1,  •  ■  •  ,  N. 

Remark  3.1.  In  (3.6),  G(j-1  is  most  expensive  to  compute.  To  efficiently  compute  it,  we  write  G([-1  = 
Ak_1  (Bfe_1)TBfc_1  -  M(ri)Bfe_1.  Using  (A  ©  B)T(A  ©  B)  =  (ATA)  *  (BTB),  we  compute  (Bfe_1)TBfe_1 
by 

(Bfe_1)TBfc_1  =  ((a*)ta*)  * ...  *  ((At  1)ta^_1)  *  ((A^;11)TA^11)  *  ...  *  ((A^"1)tA^_1)  . 

Then,  M(„)Bfc_1  can  be  obtained  by  the  matricized-tensor-times-Khatri-Rao-product  [6]. 

Alg.  2  summarizes  how  to  apply  Alg.  1  with  update  (1.2c)  to  problem  (3.2). 
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Algorithm  2  Alternating  proximal  gradient  method  for  solving  (3.2) 

Input:  nonnegative  N- way  tensor  At  and  rank  r. 

Output:  nonnegative  factors  Ai,  ■  ■  ■  ,  Ajv. 

Initialization:  choose  a  positive  number  5^  <  1  and  randomize  A”1  =  A°  ,  n  =  1,  •  •  •  ,  N,  as  nonnegative  matrices 
of  appropriate  sizes, 
for  k  =  1,  2,  •  •  •  do 

for  n  =  1,  2,  •  •  •  ,  N  do 

Compute  and  set  ui^T1  according  to  (3.5); 

Let  At1  =  A*-1  +  ^(AjT1  -  A(T2); 

Update  Ak  according  to  (3.6). 

end  for 

if  F  (Afe)  >  F  (Afe_1)  then 

Re-update  A„  according  to  (3.6)  with  A*-1  =  A);-1,  n  =  1,  ■  •  ■  ,  N 

end  if 

if  stopping  criterion  is  satisfied  then 
Return  A*,  •  •  •  ,A%. 

end  if 
end  for 


3.3.  Convergence  results.  Since  problem  (3.2)  is  a  special  case  of  problem  (1.1),  the  convergence 
results  in  Sec.  2  apply  to  Alg.  2.  Let  Vn  =  K(h"xr  and  $£>„(•)  be  the  indicator  function  on  T>n  for 
n=  1,  •  ■  •  ,1V.  Then  (3.2)  is  equivalent  to 


N 

min  Q{ A)  =  F( A)  +  V  foB( A„).  (3.7) 

Ai,— ,Ajv  ' 

n=l 

According  to  the  discussion  in  Sec.  2.2,  Q  is  a  semi-algebraic  function  and  satisfies  the  KL  property 
(2.10)  at  any  feasible  point.  Further,  we  get  9  ^  0  in  (2.10)  for  Q  at  any  critical  point.  By  writing 
the  first-order  optimality  conditions  of  (3.7),  one  can  find  that  if  (Ai,  •  •  •  ,  A,y)  is  a  critical  point,  then  so 
is  (tAi,  jA2,  A3,  •  •  •  ,  Ajv)  for  any  t  >  0.  Therefore,  from  Theorems  2.8  and  2.9  and  the  above  discussions, 
we  have 

Theorem  3.1.  Let  {Afc}  be  the  sequence  generated  by  Alg.  2.  Assume  0<£<£k<L^<L<oo  for 
all  n  and  k.  Then  {AA }  converges  to  a  critical  point  A.  and  the  asymptotic  convergence  rates  in  parts  2  and 
3  of  Theorem  2.9  apply. 

Remark  3.2.  A  simple  way  to  satisfy  the  upper  boundedness  condition  of  L is  to  scale  (Ai,  •  •  •  ,  Ajv) 
so  that  II  Ai||f  =  •  •  ■  =  ||Ajv||f  after  each  iteration,  and  the  lower  boundedness  condition  can  be  satisfied  if 
the  initial  point  is  zero. 


3.4.  An  algorithm  for  nonnegative  tensor  completion.  Alg.  2  can  be  easily  modified  for  solving 
the  nonnegative  tensor  completion  problem 


min  -||Pn(Al  -  Ai  o  A2  o  •  •  •  o  Ajv) 
Ai,*--  ,A]\r>0  Z 


If, 


(3.8) 


where  f l  C  [/i]  x  [If\  x  •  •  •  [Jjv]  is  the  index  set  of  the  observed  entries  of  AA.  and  Vn(X)  keeps  the  entries  of 
A  in  f l  and  sets  the  remaining  ones  to  zero.  Nonnegative  matrix  completion  (corresponding  to  N  =  2)  has 
been  proposed  in  [73],  where  it  is  demonstrated  that  a  low-rank  and  nonnegative  matrix  can  be  recovered 
from  a  small  set  of  its  entries  by  taking  advantages  of  both  low-rankness  and  nonnegative  factors  .  To  solve 
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(3.8),  we  transform  it  into  the  equivalent  problem 

min  G(A,  Af)  =  ^||Af  -  Ai  o  A2  o  o  Ajvlll-,  subject  to 'Pq(A')  = 'Pq(AI).  (3.9) 

-Y,A„>0,ra=l,---  ,N  l 

Our  algorithm  shall  cycle  through  the  decision  variables  A1;  ■  ■  •  ,  Ajv  and  X .  To  save  space,  we  describe  a 
modification  to  Alg.  2.  At  each  iteration  of  Alg.  2,  we  set  its  A4  to  Xk~1  and,  after  its  updates  (1.2c)  on 
Ai,  •  •  ■  ,  Ajv,  perform  update  (1.2a)  on  X  as 

Xk  =Va{M)+Vac{Aklo...oAkN),  (3.10) 

where  fic  is  the  complement  of  f 2.  Note  that  for  a  fixed  A,  G( A,  X)  is  a  strongly  convex  function  of  X  with 
modulus  1,  Hence,  the  convergence  result  for  Alg.  2  still  holds  for  this  algorithm  with  extra  update  (3.10). 

4.  Numerical  results.  In  this  section,  we  test  Alg.  2  for  nonnegative  matrix  and  three-way  tensor 
factorization,  as  well  as  their  completion.  In  our  implementations,  we  simply  choose  6U  =  1.  The  algorithm 
is  terminated  whenever  Fk ,  FJ:+1  <  tol  holds  for  three  iterations  in  a  row  or  M  /U  <  tol  is  met,  where  Fi.  is 
the  objective  value  after  iteration  k  and  tol  is  specified  below.  We  compare 

•  APG-MF:  nonnegative  matrix  factorization  (NMF)  by  Alg.  2  in  Sec.  3.2; 

•  APG-TF:  nonnegative  tensor  factorization  (NTF)  by  Alg.  2  in  Sec.  3.2; 

•  APG-MC:  nonnegative  matrix  completion  (NMC)  by  modified  Alg.  2  in  Sec.  3.4; 

•  APG-TC:  nonnegative  tensor  completion  (NTC)  by  modified  Alg.  2  in  Sec.  3.4. 

All  the  tests  were  performed  on  a  laptop  with  an  i7-620m  CPU  and  3GB  RAM  and  running  32-bit  Windows 
7  and  Matlab  2010b  with  Tensor  Toolbox  of  version  2.5  [7]. 

4.1.  Nonnegative  matrix  factorization.  We  choose  to  compare  the  most  popular  and  recent  algo¬ 
rithms.  The  first  two  compared  ones  are  the  alternating  least  square  method  (Als-MF)  [9,53]  and  multi¬ 
plicative  updating  method  (Mult-MF)  [39],  which  are  available  as  MATLAB’s  function  nnmf  with  specifiers 
als  and  mult,  respectively.  The  recent  ANLS  method  Blockpivot-MF  is  compared  since  it  outperforms  all 
other  compared  ANLS  methods  in  both  speed  and  solution  quality  [32].  Another  compared  algorithm  is 
the  recent  ADM-based  method  ADM-MF  [77].  Although  both  Blockpivot-MF  and  ADM-MF  have  superior 
performance  than  Als-MF  and  Mult-MF,  we  include  them  in  the  first  two  tests  below  due  to  their  popularity. 

We  set  tol  =  10”4  for  all  the  compared  algorithms  except  ADM-MF,  for  which  we  set  tol  =  10-5 
since  it  is  a  dual  algorithm  and  10 _4  is  too  loose.  The  maximum  number  of  iterations  is  set  to  2000  for 
all  the  compared  algorithms.  The  same  random  starting  points  are  used  for  all  the  algorithms  except  for 
Mult-MF.  Since  Mult-MF  is  very  sensitive  to  initial  points,  we  set  the  initial  point  by  running  Mult-MF  10 
iterations  for  5  independent  times  and  choose  the  best  one.  All  the  other  parameters  for  Als-MF,  Mult-MF, 
Blockpivot-MF  and  ADM-MF  are  set  to  their  default  values. 

4.1.1.  Synthetic  data.  Each  matrix  in  this  test  is  exactly  low-rank  and  can  be  written  in  the  form 
of  M  =  LR,  where  L  and  R  are  generated  by  MATLAB  commands  max(0,randn(m,q))  and  rand(q,n), 
respectively.  It  is  worth  mentioning  that  generating  R  by  rand(q,n)  makes  the  problems  more  difficult 
than  max(0,randn(q,n) )  or  abs (randn(q,n) ) .  The  algorithms  are  compared  with  fixed  n  =  1000  and  m 
chosen  from  {200,  500, 1000},  q  from  {10, 20, 30}.  The  parameter  r  is  set  to  q  in  (3.2).  We  use  relative  error 
relerr  =  HA1A2  —  M||i?/||M|| j?  and  CPU  time  (in  seconds)  to  measure  performance.  Table  4.1  lists  the 
average  results  of  20  independent  trials.  From  the  table,  we  can  see  that  APG-MF  outperforms  all  the  other 
algorithms  in  both  CPU  time  and  solution  quality. 

4.1.2.  Image  data.  In  this  subsection,  we  compare  APG-MF  (proposed),  ADM-MF,  Blockpivot-MF, 
Als-MF  and  Mult-MF  on  the  CBCL  and  ORL  image  databases  used  in  [25,42].  There  are  6977  face  images 
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Table  4.1 

Comparison  on  nonnegative  random  m  X  n  matrices  for  n  =  1000,’  bold  are  large  error  or  slow  time. 


APG-MFt  (prop’d) 

ADM-MF 

Blockpivot-MF 

Als-MF 

Mult-MF 

m 

r 

relerr 

time 

relerr 

time 

relerr 

time 

relerr 

time 

relerr 

time 

200 

10 

9.98e-5 

7.16e-l 

2.24e-3 

1.04e+0 

5.36e-4 

1.30e+0 

7.39e-3 

1.04e+0 

3.61e-2 

2.67e+0 

200 

20 

9.97e-5 

2.09e+0 

3.02e-3 

2.80e+0 

1.02e-3 

4.71e+0 

1. Ole-2 

2.33e+0 

4.64e-2 

3.61e+0 

200 

30 

9.97e-5 

4.72e+0 

4.55e-3 

5.70e+0 

1.75e-3 

1.06e+l 

1.04e-2 

4.54e+0 

4.09e-2 

5.53e+0 

500 

10 

9.98e-5 

1.61e+0 

2.26e-3 

2.39e+0 

5.11e-4 

2.38e+0 

1.15e-2 

2.99e+0 

3.58e-2 

7.76e+0 

500 

20 

9.98e-5 

3.66e+0 

2.82e-3 

4.38e+0 

5.53e-4 

6.86e+0 

1.08e-2 

6.31e+0 

4.96e-2 

7.99e+0 

500 

30 

9.98e-5 

7.75e+0 

3.51e-3 

8.34e+0 

5.75e-4 

1.37e+l 

1.29e-2 

9.95e+0 

4.42e-2 

1.20e+l 

1000 

10 

9.98e-5 

2.86e+0 

2.11e-3 

3.44e+0 

4.99e-4 

3.18e+0 

1.54e-3 

8.04e+0 

3.25e-2 

1.55e+l 

1000 

20 

9.98e-5 

7.44e+0 

2.82e-3 

7.19e+0 

5.46e-4 

1.05e+l 

1.74e-2 

1.75e+l 

4.96e-2 

l.Ole+l 

1000 

30 

9.98e-5 

1.27e+l 

3. Ole-3 

1.28e+l 

5.76e-4 

2.00e+l 

1.99e-2 

2.61e+l 

4.57e-2 

2.21e+l 

f:  the  relerr  values  of  APG-MF  are  nearly  the  same  due  to  the  use  of  the  same  stopping  tolerence. 


Table  4.2 

Comparison  on  2000  selected  images  from  the  CBCL  face  database;  bold  are  large  error  or  slow  time. 


APG-MF  (proposed) 

ADM-MF 

Blockpivot-MF 

Als-MF 

Mult-MF 

r 

relerr 

time 

relerr 

time 

relerr 

time 

relerr 

time 

relerr 

time 

30 

1.91e-l 

3.68 

1.92e-l 

7.33 

1.90e-l 

21.5 

3.53e-l 

3.15 

2.13e-l 

6.51 

60 

1.42e-l 

12.5 

1.43e-l 

19.5 

1.40e-l 

63.2 

4.59e-l 

1.80 

1.74e-l 

12.1 

90 

1.13e-l 

26.7 

1.15e-l 

34.2 

1.12e-l 

111 

6.00e-l 

2.15 

1.52e-l 

18.4 

in  the  training  set  of  CBCL,  each  having  19  x  19  pixels.  Multiple  images  of  each  face  are  taken  with  varying 
illuminations  and  facial  expressions.  The  first  2000  images  are  used  for  test.  We  vectorize  every  image 
and  obtain  a  matrix  M  of  size  361  x  2000.  Rank  r  is  chosen  from  {30,60,90}.  The  average  results  of 
10  independent  trials  are  given  in  Table  4.2.  We  can  see  that  APG-MF  outperforms  ADM-MF  in  both 
speed  and  solution  quality.  APG-MF  is  as  accurate  as  Blockpivot-MF  but  runs  much  faster.  Als-MF  and 
Mult-MF  fail  this  test,  and  Als-MF  stagnates  at  solutions  of  low  quality  at  the  very  beginning.  Due  to  the 
poor  performance  of  Als-MF  and  Mult-MF,  we  only  compare  APG-MF,  ADM-MF  and  Blockpivot-MF  in 
the  remaining  tests. 

The  ORL  database  has  400  images  divided  into  40  groups.  Each  image  has  112  x  92  pixels,  and  each 
group  has  10  images  of  one  face  taken  from  10  different  directions  and  with  different  expressions.  All  the 
images  are  used  for  test.  We  vectorize  each  image  and  obtain  a  matrix  M  of  size  10304  x  400.  As  in  last  test, 
we  choose  r  from  {30,60,90}.  The  average  results  of  10  independent  trials  are  listed  in  Table  4.3.  From  the 
results,  we  can  see  again  that  APG-MF  is  better  than  ADM-MF  in  both  speed  and  solution  quality,  and  in 
far  less  time  APG-MF  achieves  comparable  relative  errors  as  Blockpivot-MF. 

4.1.3.  Hyper  spectral  data.  It  has  been  shown  in  [54]  that  NMF  can  be  applied  to  spectral  data 
analysis.  In  [54],  a  regularized  NMF  model  is  also  considered  with  penalty  terms  a||A1|||.  and  /3|| A2 1| 
added  in  the  objective  of  (3.2).  The  parameters  a  and  (3  can  be  tuned  for  specific  purposes  in  practice. 
Here,  we  focus  on  the  original  NMF  model  to  show  the  effectiveness  of  our  algorithm.  However,  our  method 
can  be  easily  modified  for  solving  the  regularized  NMF  model.  In  this  test,  we  use  a  150  x  150  x  163 
hyperspectral  cube  to  test  the  compared  algorithms.  Each  slice  of  the  cube  is  reshaped  as  a  column  vector, 
and  a  22500  x  163  matrix  M  is  obtained.  In  addition,  the  cube  is  scaled  to  have  a  unit  maximum  element. 
Four  selected  slices  before  scaling  are  shown  in  Figure  4.1  corresponding  to  the  1st,  50th,  100th  and  150th 
columns  of  M.  The  dimension  r  is  chosen  from  {20,30,40,50},  and  Table  4.4  lists  the  average  results  of  10 
independent  trials.  We  can  see  from  the  table  that  APG-MF  is  superior  to  ADM-MF  and  Blockpivot-MF 
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Table  4.3 

Comparison  on  the  images  from  the  ORL  face  database;  bold  are  slow  time. 


APG-MF  (proposed) 

ADM-MF 

Blockpivot-MF 

r 

relerr 

time 

relerr 

time 

relerr 

time 

30 

1.67e-l 

15.8 

1.71e-l 

46.5 

1.66e-l 

74.3 

60 

1.41e-l 

42.7 

1.45e-l 

88.0 

1.40e-l 

178 

90 

1.26e-l 

76.4 

1.30e-l 

127 

1.25e-l 

253 

Fig.  4.1.  Hyperspectral  recovery  at  sample  ratio  SR  =  0.20  and  run  time  T  =  50/  four  selected  slices  are  shown 


in  both  speed  and  solution  quality. 

4.1.4.  Nonnegative  matrix  completion.  In  this  subsection,  we  compare  APG-MC  and  the  ADM- 
based  algorithm  (ADM-MC)  proposed  in  [73]  on  the  hyperspectral  data  used  in  last  test.  It  is  demonstrated 
in  [73]  that  ADM-MC  outperforms  other  matrix  completion  solvers  such  as  APGL  and  LMaFit  on  recovering 
nonnegative  matrices  because  ADM-MC  takes  advantages  of  data  nonnegativity  while  the  latter  two  do  not. 
We  fix  the  dimension  r  =  40  in  (3.8)  and  choose  sample  ratio  SR  =  ^  from  {0.20,  0.30,  0.40},  where  the 
samples  in  D  are  chosen  at  random.  The  parameter  <5 u  for  APG-MC  is  set  to  1,  and  all  the  parameters  for 
ADM-MC  are  set  to  their  default  values.  To  make  the  comparison  consistent,  we  let  both  of  the  algorithms 
run  to  a  maximum  time  (sec)  T  =  50, 100,  and  we  employ  peak-signal-to-noise-ratio  (PSNR)  and  mean 
squared  error  (MSE)  to  measure  the  performance  of  the  two  algorithms.  Table  4.5  lists  the  average  results 
of  10  independent  trials.  From  the  table,  we  can  see  that  APG-MC  is  significantly  better  than  ADM-MC  in 
all  cases. 

4.2.  Nonnegative  three-way  tensor  factorization.  To  the  best  of  our  knowledge,  all  the  existing 
algorithms  for  nonnegative  tensor  factorizations  are  extensions  of  those  for  nonnegative  matrix  factorization 
including  multiplicative  updating  method  [71],  hierachical  alternating  least  square  algorithm  [19],  alternaing 
Poisson  regression  algorithm  [17]  and  alternating  nonnegative  least  square  (ANLS)  methods  [31,33].  We 
compare  APG-TF  with  two  ANLS  methods  AS-TF  [31]  and  Blockpivot-TF  [33],  which  are  also  proposed 
based  on  the  CP  decomposition  and  superior  over  many  other  algorithms.  We  set  tol  =  1CU4  and  maxit  = 
2000  for  all  the  compared  algorithms.  All  the  other  parameters  for  Blockpivot-TF  and  AS-TF  are  set  to 
their  default  values. 

4.2.1.  Synthetic  data.  We  compare  APG-TF,  Blockpivot-TF  and  AS-TF  on  randomly  generated 
three-way  tensors.  Each  tensor  is  A4  =  L  o  C  o  R,  where  L,  C  are  generated  by  MATLAB  commands 
max(0,randn(Nl ,q) )  and  max(0,randn(N2,q)),  respectively,  and  R  by  rand(N3,q).  The  algorithms  are 
compared  with  two  sets  of  (N\,N2,  IV3)  and  r  =  q  =  10, 20,  30.  The  relative  error  relerr  =  ||A4  —  Ai  o  A2  o 
A3||f/||A4||f  and  CPU  time  (sec)  measure  the  performance  of  the  algorithms.  The  average  results  of  10 
independent  runs  are  shown  in  Table  4.6,  from  which  we  can  see  that  all  the  algorithms  give  similar  results. 

4.2.2.  Image  test.  NMF  does  not  utilize  the  spatial  redundancy,  and  the  matrix  decomposition  is  not 
unique.  Also,  NMF  factors  tend  to  form  the  invariant  parts  of  all  images  as  ghosts  while  NTF  factors  can 
correctly  resolve  all  the  parts  [63].  We  compare  APG-TF,  Blockpivot-TF  and  AS-TF  on  two  nonnegative 
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Table  4.4 

Comparison  on  hyperspectral  data  of  size  150  X  150  X  163;  bold  are  large  error  or  slow  time. 


APG-MF  (proposed) 

ADM-MF 

Blockpivot-MF 

r 

relerr 

time 

relerr 

time 

relerr 

time 

20 

1.18e-2 

34.2 

2.34e-2 

87.5 

1.38e-2 

62.5 

30 

9.07e-3 

63.2 

2.02e-2 

116 

1.10e-2 

143 

40 

7.56e-3 

86.2 

1.78e-2 

140 

9.59e-3 

194 

50 

6.45e-3 

120 

1.58e-2 

182 

8.00e-3 

277 

Table  4.5 

Comparison  on  a  hyperspectral  data  at  stopping  time  T  =  50, 100  (sec);  bold  are  large  error. 


T  =  50 

APG-MC  (proposed) 

ADM-MC 

T  =  100 

APG-MC  (proposed) 

ADM-MC 

Smpl.  Rate 

PSNR 

MSE 

PSNR 

MSE 

Smpl.  Rate 

PSNR 

MSE 

PSNR 

MSE 

0.20 

32.30 

5.89e-4 

28.72 

1.35e-3 

0.20 

32.57 

5.54e-4 

28.80 

1.33e-3 

0.30 

40.65 

8.62e-5 

33.58 

4.64e-4 

0.30 

41.19 

7.61e-5 

33.69 

4.52e-4 

0.40 

45.77 

2.66e-5 

38.52 

1.46e-4 

0.40 

46.03 

2.50e-5 

38.69 

1.41e-4 

three-way  tensors  in  [63].  Each  slice  of  the  tensors  corresponds  to  an  image.  The  first  tensor  is  19  x  19  x  2000 
and  is  formed  from  2000  images  in  the  CBCL  database,  used  in  Sec.  4.1.2.  The  average  performance  of 
10  independent  runs  with  r  =  40,  50,  60  are  shown  in  Table  4.7.  Another  one  has  the  size  of  32  x  32  x  256 
and  is  formed  with  the  256  images  in  the  Swimmer  dataset  [20].  The  results  of  10  independent  runs  with 
r  =  40,50,60  are  listed  in  Table  4.8.  Both  tests  show  that  APG-TF  is  consistently  faster  than  Blockpivot- 
TF  and  AS-TF.  In  particular,  APG-TF  is  much  faster  than  Blockpivot-TF  and  AS-TF  with  better  solution 
quality  in  the  second  test. 

4.2.3.  Hyperspectral  data.  NTF  is  employed  in  [76]  for  hyperspectral  unmixing.  It  is  demonstrated 
that  the  cubic  data  can  be  highly  compressed  and  NTF  is  efficient  to  identify  the  material  signatures.  We 
compare  APG-TF  with  Blockpivot-TF  and  AS-TF  on  the  150  x  150  x  163  hyperspectral  cube,  which  is  used 
in  Sec.  4.1.3.  For  consistency,  we  let  them  run  to  a  maximum  time  T  and  compare  the  relative  errors.  The 
parameter  r  is  chosen  from  {30, 40, 50, 60}.  The  relative  errors  corresponding  to  T  =  10,  25, 50, 100  are  shown 
in  Table  4.9,  as  the  average  of  10  independent  trials.  We  can  see  from  the  table  that  APG-TF  achieves  the 
same  accuracy  much  earlier  than  Blockpivot-TF  and  AS-TF. 

4.2.4.  Nonnegative  tensor  completion.  Recently,  [45]  proposes  tensor  completion  based  on  mini¬ 
mizing  tensor  n-rank,  the  matrix  rank  of  mode-n  matricization  of  a  tensor.  Using  the  matrix  nuclear  norm 
instead  of  matrix  rank,  they  solve  the  convex  program 

N 

min^an||X(n)||*,  subject  to  Vq{X)  =  Pn(M),  (4.1) 

n=  1 

where  ara’s  are  pre-specified  weights  satisfying  J2n  an  =  1  and  jj  A|j*  is  the  nuclear  norm  of  A  defined  as  the 
sum  of  singular  values  of  A.  Meanwhile,  they  proposed  some  algorithms  to  solve  (4.1)  or  its  relaxed  versions, 
including  simple  low-rank  tensor  completion  (SiLRTC),  fast  low-rank  tensor  completion  (FaLRTC)  and  high 
accuracy  low-rank  tensor  completion  (HaLRTC).  We  compare  APG-TC  with  FaLRTC  on  synthetic  three- 
way  tensors  since  FaLRTC  is  more  efficient  and  stable  than  SiLRTC  and  HaLRTC.  Each  tensor  is  generated 
similarly  as  in  Sec.  4.2.1.  Rank  q  is  chosen  from  {10,20,30}  and  sampling  ratio  SR  =  |S~2|/(iVi IV2 A3)  from 
{0.10, 0.30,  0.50}.  For  APG-TC,  we  use  r  =  q  and  r  =  [1.25gJ  in  (3.8).  We  set  tol  =  10-4  and  maxit  =  2000 
for  both  algorithms.  The  weights  an's  in  (4.1)  are  set  to  an  =  1,  n  =  1,  2, 3,  and  the  smoothing  parameters 
for  FaLRTC  are  set  to  /.in  =  ^“-,71  =  1,2,3.  Other  parameters  of  FaLRTC  are  set  to  their  default  values. 
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Table  4.6 

Comparison  on  synthetic  three-way  tensors;  bold  are  large  error  or  slow  time. 


Problem  Setting 

APG-TF  (proposed) 

AS-TF 

Blockpivot-TF 

Ah 

n2 

n3 

9 

relerr 

time 

relerr 

time 

relerr 

time 

80 

80 

80 

10 

8.76e-005 

4.39e-001 

7.89e-005 

8.64e-001 

8.62e-005 

8.19e-001 

80 

80 

80 

20 

9.47e-005 

1.26e+000 

1.97e-004 

1.45e+000 

1.77e-004 

1.21e+000 

80 

80 

80 

30 

9.65e-005 

2.83e+000 

2.05e-004 

2.13e+000 

2.07e-004 

1.95e+000 

50 

50 

500 

10 

9.15e-005 

1.27e+000 

1.07e-004 

1.91e+000 

9.54e-005 

1.87e+000 

50 

50 

500 

20 

9.44e-005 

3.42e+000 

1.86e-004 

3.17e+000 

1.77e-004 

3.47e+000 

50 

50 

500 

30 

9.74e-005 

7.11e-(-000 

1.89e-004 

5.04e+000 

1.88e-004 

4.54e+000 

Table  4.7 

Comparison  results  on  CBCL  database ;  bold  are  slow  time. 


APG-TF  (proposed) 

AS-TF 

Blockpivot-TF 

r 

relerr 

time 

relerr 

time 

relerr 

time 

40 

1.85e-001 

9.95e+000 

1.86e-001 

2.99e+001 

1.85e-001 

2.04e+001 

50 

1.68e-001 

1.65e+001 

1.68e-001 

4.55e+001 

1.69e-001 

2.47e+001 

60 

1.53e-001 

2.13e+001 

1.56e-001 

4.16e+001 

1.56e-001 

2.85e+001 

The  average  results  of  10  independent  trials  are  shown  in  Table  4.10.  We  can  see  that  APG-TC  produces 
much  more  accurate  solutions  within  less  time. 

4.3.  Summary.  Although  our  test  results  are  obtained  with  a  given  set  of  parameters,  it  is  clear  from 
the  results  that,  compared  to  the  existing  algorithms,  the  proposed  ones  can  return  solutions  of  similar  or 
better  quality  in  less  time.  Tuning  the  parameters  of  the  compared  algorithms  can  hardly  obtain  much 
improvement  in  both  solution  quality  and  time.  We  believe  that  the  superior  performance  of  the  proposed 
algorithms  is  due  to  the  use  of  prox-linear  steps,  which  are  not  only  easy  to  compute  but  also,  as  a  local 
approximate,  help  avoid  the  small  regions  around  certain  local  minima. 

5.  Conclusions.  We  have  proposed  a  block  coordinate  descent  method  with  three  choices  of  update 
schemes  for  multi-convex  optimization,  with  subsequence  and  global  convergence  guarantees  under  certain 
assumptions.  It  appears  to  be  the  first  globally  converging  algorithm  for  nonnegative  matrix/tensor  factor¬ 
ization  problems.  Numerical  results  on  both  synthetic  and  real  image  data  illustrate  the  high  efficiency  of 
the  proposed  algorithm. 
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NSF  grants  DMS-0748839  and  ECCS-1028790,  and  ONR  Grant  N00014-08- 1-1101.  The  authors  would  like 
to  thank  Zhi-Quan  (Tom)  Luo  for  very  help  discussions  and  sharing  his  manuscript  [57]. 

Appendix  A.  Proofs  of  Lemma  3  and  Theorem  3. 

A.l.  Proof  of  Lemma  2.6.  Without  loss  of  generality,  we  assume  F  =  0.  Otherwise,  we  can  consider 
F  —  F.  Let  R(x,p)  =  {x  :  ||x  —  x||  <  p}  C  U  for  some  p  >  0  where  U  is  the  neighborhood  of  x  in  (2.11)  with 
ip  =  F,  and  let  Lq  be  the  global  Lipschitz  constant  for  VXi/(x),  *  =  1,  ■  ■  ■  ,  s  within  B(x,  a/IO p),  namely, 

II VXi/(x)  -  VXi/(y)||  <  Lg||x  —  y ||,  i  =  1, •  •  •  ,  s 

for  any  x,y  £  R(x,  a/IOp).  In  addition,  let  C\  =  an<^  =  +  i-<5  Suppose 

Ci4>(F0  -F)  +  C2Vf0-F+  ||x°  -  x||  <  p, 
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(A.l) 


Table  4.8 

Comparison  results  on  Swimmer  database;  bold  are  large  error  or  slow  time. 


APG-TF  (proposed) 

AS-TF 

Blockpivot-TF 

r 

relerr 

time 

relerr 

time 

relerr 

time 

40 

2.43e-001 

2.01e+000 

2.71e-001 

2.09e+001 

2.53e-001 

2.50e+001 

50 

1.45e-001 

3.21e+000 

2.00e-001 

5.54e+001 

1.87e-001 

3.23e+001 

60 

3.16e-002 

6.91e+000 

1.10e-001 

3.55e+001 

7.63e-002 

3.74e+001 

Table  4.9 

Relative  errors  on  hyperspectral  data. 


APG-TF 

proposed) 

AS 

TF 

Blockpivot-TF 

r\T 

10 

25 

50 

100 

10 

25 

50 

100 

10 

25 

50 

100 

30 

2.56e-l 

2.53e-l 

2.53e-l 

2.53e-l 

2.60e-l 

2.56e-l 

2.54e-l 

2.53e-l 

2.60e-l 

2.56e-l 

2.54e-l 

2.53e-l 

40 

2.32e-l 

2.27e-l 

2.26e-l 

2.26e-l 

2.37e-l 

2.30e-l 

2.28e-l 

2.26e-l 

2.36e-l 

2.29e-l 

2.28e-l 

2.27e-l 

50 

2.14e-l 

2.07e-l 

2.04e-l 

2.04e-l 

2.20e-l 

2.11e-l 

2.07e-l 

2.06e-l 

2.17e-l 

2.10e-l 

2.07e-l 

2.05e-l 

60 

2.00e-l 

1.91e-l 

1.87e-l 

1.86e-l 

2.04e-l 

1.95e-l 

1.91e-l 

1.88e-l 

2. 01e-l 

1.94e-l 

1.90e-l 

1.88e-l 

which  quantifies  how  close  to  x  the  initial  point  needs  to  be.  We  go  to  show  xfc  £  B(x,p)  for  all  k  >  0  and 
the  estimation 

°°  O  I  J) 

Y  l|xfc+1-xfc||  <C1<?f.(Fiv-JF)  +  ||x7V-1-x7V-2||  +  T^||xAr-xJV-1||,VfV>2.  (A.2) 

k=N  1 


First,  we  prove  xfc  £  B(x,p)  by  induction  on  k.  Obviously,  x°  £  B(x,p).  For  k  =  1,  we  have  from  (2.7) 


that 


L? 


F0  >  F0  -  F,  >  ]T  -f  l|x?  -  >  -||x°  -  x1!!2. 


Hence,  ||x°  —  x1 1|  <  \/jFq,  and 


llx1  -  xll  <  ||x0-x1||  +  ||x°-x||  <  \!  -:Fn  +  1 1  X°  —  x| 


which  indicates  x1  £  B(x,p).  For  fc  =  2,  we  have  from  (2.7)  that  (regard  wf  =  0  for  i  £  X\  UI2) 

f,  >  Ft  -  >  £  fiK1  -  -  e  fwm  -  xjf . 

2—1  2—1 

Note  L\(lo})2  <  52£°  for  i  =  1,  ■  ■  ■  ,  s.  Thus,  it  follows  from  the  above  inequality  that 

<  Y  -xill2  <  F0+  y€l|x°  -x1!!2  <  (l+j5l)F0, 

2—1 

which  implies  Hx1  —  x2 1|  <  yj '2+^^  Fo-  Therefore, 


2=1 


llx2 -xll  <  ||x1-x2||  +  ||x1-x||  < 


x°  -  xll 


and  thus  x2  £  B(x,p). 
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Table  4.10 

Comparison  results  on  synthetic  nonnegative  tensor  completion;  bold  are  bad  or  slow. 


Problem  Setting 

APG-TC  (prop’d)  r  =  q 

APG-TC  (prop’d)  r  =  Ll.25gJ 

FaLRTC 

Ni 

n2 

n3 

Q 

SR 

relerr 

time 

relerr 

time 

relerr 

time 

80 

80 

80 

10 

0.10 

2.02e-004 

4.09e+000 

6.08e-004 

6.88e+000 

4.61e-001 

3.17e+001 

80 

80 

80 

10 

0.30 

1.18e-004 

2.52e+000 

3.29e-004 

5.85e+000 

1.96e-002 

1.96e+001 

80 

80 

80 

10 

0.50 

9.54e-005 

2.22e+000 

2.45e-004 

5.28e+000 

1.13e-002 

1.52e+001 

80 

80 

80 

20 

0.10 

1.50e-004 

9.55e+000 

4.84e-004 

1.60e+001 

4.41e-001 

2.47e+001 

80 

80 

80 

20 

0.30 

1.15e-004 

6.08e+000 

2.64e-004 

1.23e+001 

1.43e-001 

1.17e+001 

80 

80 

80 

20 

0.50 

9.65e-005 

5.01e+000 

1.72e-004 

1.27e+001 

1.46e-002 

1.95e+001 

80 

80 

80 

30 

0.10 

3.14e-003 

1.64e+001 

4.23e-004 

2.66e+001 

4.00e-001 

2.08e+001 

80 

80 

80 

30 

0.30 

1.04e-004 

1.12e+001 

1.94e-004 

2.11e+001 

2.22e-001 

8.12e+000 

80 

80 

80 

30 

0.50 

1.14e-004 

9.91e+000 

1.47e-004 

2.00e+001 

5.60e-002 

1.28e+001 

50 

50 

500 

10 

0.10 

2.76e-004 

1.16e+001 

4.69e-004 

2.03e+001 

5.52e-001 

1.83e+002 

50 

50 

500 

10 

0.30 

9.81e-005 

6.24e+000 

2.12e-004 

1.62e+001 

8.58e-002 

9.69e+001 

50 

50 

500 

10 

0.50 

9.51e-005 

5.34e+000 

1.74e-004 

1.63e+001 

1.25e-002 

9.63e+001 

50 

50 

500 

20 

0.10 

1.80e-004 

2.45e+001 

3.50e-004 

4.37e+001 

4.82e-001 

1.32e+002 

50 

50 

500 

20 

0.30 

3.95e-003 

1.34e+001 

1.59e-004 

3.91e+001 

2.76e-001 

5.82e+001 

50 

50 

500 

20 

0.50 

5.32e-003 

1.18e+001 

1.15e-004 

3.53e+001 

9.44e-002 

5.59e+001 

50 

50 

500 

30 

0.10 

7.09e-003 

3.90e+001 

5.08e-004 

6.76e+001 

4.32e-001 

1.18e+002 

50 

50 

500 

30 

0.30 

1.03e-004 

2.54e+001 

1.26e-004 

6.32e+001 

2.76e-001 

5.04e+001 

50 

50 

500 

30 

0.50 

3.28e-003 

2.30e+001 

1.03e-004 

5.56e+001 

1.62e-001 

4.30e+001 

Suppose  xfc  €  .B(x,p)  for  0  <  k  <  K.  We  go  to  show  xA+1  £  B(x,p).  For  k  <  K,  note 

-V/f  (xf)  +  VXi/(xfe)  e  dr^k)  +  VXi/(xfc),  *  e  Xi, 

-i?_1(x?  -  X?"1)  -  V/f  (xf)  +  VXi/(xfc)  €  dntf)  +  VXl/(xfc),  t  G  X2, 

-4_1(x?  -  i?"1)  -  V/f  (x?-1)  +  VXi/(xfe)  G  ^(x?)  +  VXl/(xfc),  t  G  X3, 


and 


9F(xfe)  =  (an(xf)  +  VXl/(xfc)}  x  •  •  •  x  {drs(x*)  +  Vx,/(xfc)}  , 
so  (for  i  £l1  UX2,  regard  x^_1  =  x^-1  in  x^  —  xf_1  and  xf”1  =  x^  in  V/f  (x^_1)  —  VXi/(xfc),  respectively) 


dist  (0, 5F(xfc))  <  ||  (Lf-^xJ  -  x^-1),  ■  ■  •  ,  Lk~\x.ks  -  x^"1))  ||  +  £  ||  V/ftx*"1)  -  Vx,/(xfc)||  .  (A.3) 

i=l 

For  the  first  term  in  (A.3),  plugging  in  x.f”1  and  recalling  Lk_1  <  L,  w,f_1  <  1  for  i  =  1,  •  •  •  ,  s,  we  can  easily 
get 


(£i  (x?  -  xj"1),  •  •  •  j  Xg_1(Xg  -  x*"1))  ||  <  L  (||xfc  -  x*-1!!  + 


k- 1  „fc-2| 


X  —  X 


(A.4) 


For  the  second  term  in  (A.3),  it  is  not  difficult  to  verify  (xk,  ■  ■  ■  ,xk_ltx.k  \ 
addition,  note  V/ftxJ"1)  =  VXi/  (x{,  •  •  ■  ,xti,X;“\  ■  •  •  ,x.k~1)  .  Hence, 


rk— 


G  £(x,  VlO p).  In 


£U  II Vxj^xf-1)  -  Vxj(xfc)||  <  EU  ^  II  (x 


-k- 1 


Jfc-1 


1  ’  7  1  7  7  7 


<  sLg  (||xfc  -x' 


k- 1| 


(A.5) 


Combining  (A.3),  (A.4)  and  (A.5)  gives 


dist(0,aF(xfc))  <{L  +  sLg )  (||xte  -  x^-1!!  +  ||xfe-1  -  xfc-2' 
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which  together  with  the  KL  inequality  (2.11)  implies 


t'm  >(L  +  sLg r1  (||xfc  -  x"-1!!  +  llx"-1  -  xfe-2||)  1 . 

Note  that  (f>  is  concave  and  4>'(Fk)  >  0.  Thus  it  follows  from  (2.7)  and  (A. 6)  that 

eu  (m  x?  -  x?+1n2  -  x?n2) 


(A.6) 


<t>(Fk)  -  <f>(Fk+i)  >  (j>'{Fk)(Fk  -  Fk+1)  > 


2 (L  +  sLq )  (||xfc  —  xfc— 1 1|  +  ||xfe_1  —  vfc_2l 


or  equivalently 


2  —  1 


fc||Xfc  _  Vfc+1ll2 


X7  -  x;  '  dr  <2 (L  +  sLg)  (||xfc  -  xfc-1||  +  Hx*"1  -  xfc"2 


||)(</)(Ffc)-0(Ffe+1)) 


+  E^-1<5SI|x?"1-x 


fe  II  2 


i=l 


Recalling  l  <  lk  1  <  £k  <  Lk  <  L  for  all  i,  k,  we  have  from  the  above  inequality  that 


fc+l||2  <  2(L  +  sLg ) 


cfc_xfe-i||  +  ||xfe-i_xfc-2 


!||)  (<f>{Fk)  -  0(Ffe+1))  +  52  Hx^-1  -  xfe||2.  (A. 7) 


Using  inequalities  a2  +  b2  <  (a  +  b)2  and  ab  <  ta2  +  for  t  >  0,  we  get  from  (A. 7)  that 


|xfc  -xfc+1| 


<(||xfc-xfc-1| 


Hxfc-r  _xfe-2|it2 


I  (2 (L  +  sLg) 

V  t 


(<j>(Fk)  -  </>{Fk+1))  )  +  <L||xfc_1  —  xkl 


<- 


.1  -& 


"  ^'xfc  -x*-1! 


n/c-r  _xfc-2| 


3(L  +  sLg) 

2*(1  -  <L) 


(^(Ffc)-^(Ffe+1))  +  <5a,||xft-1-xfc||, 


or  equivalently 

3||xfc  -  xfc+1||  <  (1  +  2<U||xfc  -  x*-1!!  +  (1  -  -  xfe-2||  + 

Summing  up  (A. 8)  over  k  from  2  to  AT  and  doing  some  eliminations  give 
K 

E(1  -  ^)llxfc  -  xfc+1H  +  (2  +  ^)llx*  -  x^+1H  +  (1  -  <U||x 

k= 2 


(0(Ffc)-^(Ffc+1)).  (A. 8) 


K-i_xK\ 


<(1  -  6u)\\x°  -  xl  +  (2  +  «y  Ux1  -  x2||  +  ^(^)  -  <KFk+i)). 
Recalling  ||x°  —  x1 1|  <  yj jFq  and  Hx1  —  x2||  <  \J ' 2+2(5“  Fo ,  we  have  from  the  above  inequality  that 
||x*+1-x||<f>*-xfc+1| 


llx2  -  xll 


fc= 2 


<\  -Fn 


2  +  6U  2  +  2  51 


< 


1-5, 

9  (L  +  sLg) 


Fn 


9  (L  +  sLg) 
2£(l  -S^)2 


{<l>(F2)-tf>(FK+1))  +  ||x2-x| 


2^(1  -  <L) 2 


2  +  2  <5 2 


\f~Fo  +  ||x°  — 


x  . 


Hence,  xA+1  £  B(x,p),  and  this  completes  the  proof  of  xfe  £  B(x,p)  for  all  k  >  0. 
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Secondly,  we  prove  (A. 2)  from  (A. 8).  Indeed,  (A. 8)  holds  for  all  k  >  0.  Summing  it  over  k  from  N  to  T 
and  doing  some  eliminations  yield 

(X  “  <UI!xfc  -  xfc+1H  +  (2  +  <UllxT  -  xT+1H  +  (!  -  ^)||xT_1  -  xT|| 


k—N 


m- i  _ jv ii  ,  9(L  +  sLg) 


<(1  - <U ||x^-2  - x^-1!!  +  (2  +  ^)||x^-1  - x^H  +  ^e{i-Z)WFN)  ~ </>(Ft+i))’ 


which  implies 


k  fc+ln  ^  ii  AT— 2  „JV— 1  it  ,  2  +  <L  n  jv— 1  „JVn  i  9 (L  +  sLq) 


^!|xfc-xfc+1H<||x 


k=N 


1  -  A 


X  —  X 


2*(1  -  5W): 


by  letting  T  — >  oo.  This  completes  the  proof. 

A. 2.  Proof  of  Theorem  2.9.  If  9  =  0,  we  must  have  F(xk°)  =  F(x)  for  some  ko.  Otherwise, 
F(xk)  >  F(x)  for  all  sufficiently  large  k.  The  Kurdyka-Lojasiewicz  inequality  gives  c  •  dist(0,  dF(xk))  >  1 
for  all  k  >  0,  which  is  impossible  since  xk  — >  x  and  0  £  dF(x).  The  finite  convergence  now  follows  from  the 
fact  that  F(xk°)  =  F(5c)  implies  xfc  =  xfc°  =  x  for  all  k  >  ko- 

For  9  £  (0, 1),  we  assume  F(xk)  >  F(x)  =  0  and  use  the  same  notation  as  in  the  proof  of  Lemma  3. 
Define  Sk  =  Y2iLk  llx*  ~  xl+1||.  Then  (A. 2)  can  be  written  as 

Sk  <  CMFk)  +  -  Sk)  +  Sk- 2  -  Sk-!,  for  k  >  2, 

1  Oto 

which  implies 

Sk  <  Ci<j>(Fk)  +  P^(Sk- 2  -  Sk),  for  k  >  2,  (A.9) 

1  Ooj 

since  Sk- 2  —  Sk- 1  >  0.  Using  </>(s)  =  cs1_£>,  we  have  from  (A. 6)  for  sufficiently  large  k  that 
c(l  -  9)(Fk)~0  >(L  +  sLG )-1  (||xfc  -  x*-1!!  +  Hx^1  -  xfc-2||)_1 , 
or  equivalently  ( Fk)e  <  c(  1  —  9)(L  +  sLG)(Sk- 2  —  Sk)-  Then, 

4>{Fk)  =  c(Fk)1-6  <  c(c(l  -9)(L  +  sLG){Sk-2  -  Sk))  ^ .  (A.10) 

Letting  C3  =  Cic(c(l  —  9)(L  +  sLq))  e  and  C4  =  we  have  from  (A.9)  and  (A. 10)  that 

Sk  <  c3  (Sk- 2  -  Sk)1^  +  c4  ( Sk-2  -  sk) .  (A.ll) 


When  9  £  (0,  |],  i.e. ,  >  1,  (A.ll)  implies  that  Sk  <  (C3  +  C4,)(Sk-2  —  Sk)  for  sufficiently  large  k 

since  Sk- 2  —  Sk  — >  0,  and  thus  Sk  <  j^r^f^Sk-2-  Note  that  ||xfc  —  x||  <  Sk-  Therefore,  item  2  holds  with 

r  =  \J  i+ct+C4  <  1  and  sufficiently  large  C. 

When  9  £  (|,  1),  i.e.,  <  1,  we  get 


SVN  +  S"N_!  -  S"K+1  -SVK>  -  K),  (A. 12) 

for  v  =  <  0,  some  constant  [i  >  0  and  any  A  >  K  with  sufficiently  large  K  by  the  same  argument  as 

in  the  proof  of  Theorem  2  of  [3].  Note  Sn  <  <Sjv-i  and  v  <  0.  Hence,  (A. 12)  implies 

1 

Sn  <  Q  (S£+1  +  SVK  +  M(A  -  A)))  "  <  CA"^, 
for  sufficiently  large  C  and  A.  This  completes  the  proof. 
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